If you follow me, you know that this year I started a series called Weekly Digest for Data Science and AI: Python & R , where I highlighted the best libraries, repos, packages, and tools that help us be better data scientists for all kinds of tasks.
The great folks at Heartbeat sponsored a lot of these digests, and they asked me to create a list of the best of the best—those libraries that really changed or improved the way we worked this year (and beyond).
If you want to read the past digests, take a look here:
Disclaimer: This list is based on the libraries and packages I reviewed in my personal newsletter. All of them were trending in one way or another among programmers, data scientists, and AI enthusiasts. Some of them were created before 2018, but if they were trending, they could be considered.
AdaNet is a lightweight and scalable TensorFlow AutoML framework for training and deploying adaptive neural networks using the AdaNet algorithm [ Cortes et al. ICML 2017 ]. AdaNet combines several learned subnetworks in order to mitigate the complexity inherent in designing effective neural networks.
This package will help you selecting optimal neural network architectures, implementing an adaptive algorithm for learning a neural architecture as an ensemble of subnetworks.
You will need to know TensorFlow to use the package because it implements a TensorFlow Estimator, but this will help you simplify your machine learning programming by encapsulating training and also evaluation, prediction and export for serving.
You can build an ensemble of neural networks, and the library will help you optimize an objective that balances the trade-offs between the ensemble’s performance on the training set and its ability to generalize to unseen data.
adanet
depends on bug fixes and enhancements not present in TensorFlow releases prior to 1.7. You must install or upgrade your TensorFlow package to at least 1.7:
$ pip install "tensorflow>=1.7.0"
To install from source, you’ll first need to install
bazel
following their
installation instructions
.
Next clone
adanet
and
cd
into its root directory:
$ git clone https://github.com/tensorflow/adanet && cd adanet
From the
adanet
root directory run the tests:
$ cd adanet
$ bazel test -c opt //...
Once you have verified that everything works well, install
adanet
as a
pip package
.
You’re now ready to experiment with
adanet
.
import adanet
Here you can find two examples on the usage of the package:
You can read more about it in the original blog post:
Previously I talked about Auto-Keras, a great library for AutoML in the Pythonic world. Well, I have another very interesting tool for that.
The name is TPOT (Tree-based Pipeline Optimization Tool), and it’s an amazing library. It’s basically a Python automated machine learning tool that optimizes machine learning pipelines using genetic programming .
TPOT can automate a lot of stuff life feature selection, model selection, feature construction, and much more. Luckily, if you’re a Python machine learner, TPOT is built on top of Scikit-learn, so all of the code it generates should look familiar.
What it does is automate the most tedious parts of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data, and then it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.
This is how it works:
For more details you can read theses great article by Matthew Mayo :
and Randy Olson :
You actually need to follow some instructions before installing TPOT. Here they are:
After that you can just run:
pip install tpot
First let’s start with the basic Iris dataset:
So here we built a very basic TPOT pipeline that will try to look for the best ML pipeline to predict the
iris.target
.
And then we save that pipeline. After that, what we have to do is very simple — load the
.py
file you generated and you’ll see:
import numpy as np
from sklearn.kernel_approximation import RBFSampler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
train_test_split(features, tpot_data['class'], random_state=42)
exported_pipeline = make_pipeline(
RBFSampler(gamma=0.8500000000000001),
DecisionTreeClassifier(criterion="entropy", max_depth=3, min_samples_leaf=4, min_samples_split=9)
)
exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)
And that’s it. You built a classifier for the Iris dataset in a simple but powerful way.
Let’s go the MNIST dataset now:
As you can see, we did the same! Let’s load the
.py
file you generated again and you’ll see:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
train_test_split(features, tpot_data['class'], random_state=42)
exported_pipeline = KNeighborsClassifier(n_neighbors=4, p=2, weights="distance")
exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)
Super easy and fun. Check them out! Try it and please give them a star!
Explaining machine learning models isn’t always easy. Yet it’s so important for a range of business applications. Luckily, there are some great libraries that help us with this task. In many applications, we need to know, understand, or prove how input variables are used in the model, and how they impact final model predictions.
SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations, uniting several previous methods and representing the only possible consistent and locally accurate additive feature attribution method based on expectations.
SHAP can be installed from PyPI
pip install shap
or conda-forge
conda install -c conda-forge shap
There are tons of different models and ways to use the package. Here, I’ll take one example from the DeepExplainer.
Deep SHAP is a high-speed approximation algorithm for SHAP values in deep learning models that builds on a connection with DeepLIFT , as described in the SHAP NIPS paper that you can read here:
Here you can see how SHAP can be used to explain the result of a Keras model for the MNIST dataset:
You can find more examples here:
Take a look. You’ll be surprised :)
Ok, so full disclosure, this library is like my baby. I’ve been working on it for a long time now, and I’m very happy to show you version 2.
Optimus V2 was created to make data cleaning a breeze. The API was designed to be super easy for newcomers and very familiar for people that come from working with pandas. Optimus expands the Spark DataFrame functionality, adding
.rows
and
.cols
attributes.
With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, and Keras.
It’s super easy to us. It’s like the evolution of pandas, with a piece of dplyr, joined by Keras and Spark. The code you create with Optimus will work on your local machine, and with a simple change of masters, it can run on your local cluster or in the cloud.
You will see a lot of interesting functions created to help with every step of the data science cycle.
Optimus is perfect as a companion for an agile methodology for data science because it can help you in almost all the steps of the process, and it can easily connect to other libraries and tools.
If you want to read more about an Agile DS Methodology check this out:
pip install optimuspyspark
As one example, you can load data from a url, transform it, and apply some predefined cleaning functions:
from optimus import Optimus
op = Optimus()
# This is a custom function
def func(value, arg):
return "this was a number"
df =op.load.url("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/foo.csv")
df\
.rows.sort("product","desc")\
.cols.lower(["firstName","lastName"])\
.cols.date_transform("birth", "new_date", "yyyy/MM/dd", "dd-MM-YYYY")\
.cols.years_between("birth", "years_between", "yyyy/MM/dd")\
.cols.remove_accents("lastName")\
.cols.remove_special_chars("lastName")\
.cols.replace("product","taaaccoo","taco")\
.cols.replace("product",["piza","pizzza"],"pizza")\
.rows.drop(df["id"]<7)\
.cols.drop("dummyCol")\
.cols.rename(str.lower)\
.cols.apply_by_dtypes("product",func,"string", data_type="integer")\
.cols.trim("*")\
.show()
You can transform this:
into this:
Pretty cool, right?
You can do a thousand more things with the library, so please check it out:
spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it. It’s easy to install, and its API is simple and productive. We like to think of spaCy as the Ruby on Rails of Natural Language Processing.
spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, Scikit-learn, Gensim, and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.
pip3 install spacy
$ python3 -m spacy download en
Here, we’re also downloading the English language model. You can find models for German, Spanish, Italian, Portuguese, French, and more here:
Here’s an example from the main webpage:
# python -m spacy download en_core_web_sm
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')
# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at "
u"Google in 2007, few people outside of the company took him "
u"seriously. “I can tell you very senior CEOs of major American "
u"car companies would shake my hand and turn away because I wasn’t "
u"worth talking to,” said Thrun, now the co-founder and CEO of "
u"online higher education startup Udacity, in an interview with "
u"Recode earlier this week.")
doc = nlp(text)
# Find named entities, phrases and concepts
for entity in doc.ents:
print(entity.text, entity.label_)
# Determine semantic similarities
doc1 = nlp(u"my fries were super gross")
doc2 = nlp(u"such disgusting fries")
similarity = doc1.similarity(doc2)
print(doc1.text, doc2.text, similarity)
In this example, we first download the English tokenizer, tagger, parser, NER, and word vectors. Then we create some text, and finally we print the entities, phrases, and concepts found, and then we determine the semantic similarity of the two phrases. If you run this code you get this:
Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATE
my fries were super gross such disgusting fries 0.7139701635071919
Very simple and super useful. There is also a spaCy Universe, where you can find great resources developed with or for spaCy. It includes standalone packages, plugins, extensions, educational materials, operational utilities, and bindings for other languages:
By the way, the usage page is great, with very good explanations and code:
Take a look at the visualizers page. Awesome features, here:
For me, this is one of the packages of the year. It’s such an important part of what we do as data scientists. Almost all of us work in notebooks like Jupyter, but we also use IDEs like PyCharm for more hardcore parts of our projects.
The good news is that plain scripts, which you can draft and test in your favorite IDE, open transparently as notebooks in Jupyter when using Jupytext. Run the notebook in Jupyter to generate the outputs,
associate
an
.ipynb
representation, and save and share your research as either a plain script or as a traditional Jupyter notebook with outputs.
You can see a workflow of what you can do with the package in the gif below:
Install Jupytext with:
pip install jupytext --upgrade
Then, configure Jupyter to use Jupytext:
jupyter notebook --generate-config
.jupyter/jupyter_notebook_config.py
and append the following:c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"
jupyter notebook
You can give it a try here:
This, for me, is the winner of the year, for Python. If you are in the Python world, most likely you waste a lot of your time trying to create a decent plot. Luckily, we have libraries like Seaborn that make our life easier. But the issue is that their plots are not dynamic.
Then you have Bokeh—an amazing library—but creating interactive plots with it can be a pain in the a**. If you want to know more about Bokeh and interactive plots for Data Science, take a look at these great articles by William Koehrsen :
Chartify is built in top of Bokeh. But it’s also so much simpler.
From the authors:
pip3 install chartify
2. Install chromedriver requirement (Optional. Needed for PNG output):
echo $PATH
cp chromedriver /usr/local/bin
Let’s say we want to create this chart:
import pandas as pd
import chartify
# Generate example data
data = chartify.examples.example_data()
Now that we have some example data loaded let’s do some transformations:
total_quantity_by_month_and_fruit = (data.groupby(
[data['date'] + pd.offsets.MonthBegin(-1), 'fruit'])['quantity'].sum()
.reset_index().rename(columns={'date': 'month'})
.sort_values('month'))
print(total_quantity_by_month_and_fruit.head())
month fruit quantity
0 2017-01-01 Apple 7
1 2017-01-01 Banana 6
2 2017-01-01 Grape 1
3 2017-01-01 Orange 2
4 2017-02-01 Apple 8
And now we can plot it:
# Plot the data
ch = chartify.Chart(blank_labels=True, x_axis_type='datetime')
ch.set_title("Stacked area")
ch.set_subtitle("Represent changes in distribution.")
ch.plot.area(
data_frame=total_quantity_by_month_and_fruit,
x_column='month',
y_column='quantity',
color_column='fruit',
stacked=True)
ch.show('png')
Super easy to create a plot, and it’s interactive. If you want more examples to create stuff like this:
And more, check the original repo:
Inference, or statistical inference, is the process of using data analysis to deduce properties of an underlying probability distribution.
The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the
tidyverse
design framework.
To install the current stable version of
infer
from CRAN:
install.packages("infer")
Let’s try a simple example on the
mtcars
dataset to see what the library can do for us.
First, let’s overwrite
mtcars
so that the variables
cyl
,
vs
,
am
,
gear
, and
carb
are
factor
s.
library(infer)
library(dplyr)
mtcars <- mtcars %>%
mutate(cyl = factor(cyl),
vs = factor(vs),
am = factor(am),
gear = factor(gear),
carb = factor(carb))
# For reproducibility
set.seed(2018)
We’ll try hypothesis testing. Here, a hypothesis is proposed so that it’s testable on the basis of observing a process that’s modeled via a set of random variables. Normally, two statistical data sets are compared, or a data set obtained by sampling is compared against a synthetic data set from an idealized model.
mtcars %>%
specify
(response = mpg) %>% # formula alt: mpg ~ NULL
hypothesize
(null = "point", med = 26) %>%
generate
(reps = 100, type = "bootstrap") %>%
calculate
(stat = "median")
Here, we first specify the response and explanatory variables, then we declare a null hypothesis. After that, we generate resamples using bootstrap and finally calculate the median. The result of that code is:
## # A tibble: 100 x 2
## replicate stat
## <int> <dbl>
## 1 1 26.6
## 2 2 25.1
## 3 3 25.2
## 4 4 24.7
## 5 5 24.6
## 6 6 25.8
## 7 7 24.7
## 8 8 25.6
## 9 9 25.0
## 10 10 25.1
## # ... with 90 more rows
One of the greatest parts of this library is the
visualize
function. This will allow you to visualize the distribution of the simulation-based inferential statistics or the theoretical distribution (or both). For an example, let’s use the flights data set. First, let’s do some data preparation:
library(nycflights13)
library(dplyr)
library(ggplot2)
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>%
na.omit() %>%
sample_n
(size = 500) %>%
mutate
(season =
case_when
(
month %in% c(10:12, 1:3) ~ "winter",
month %in% c(4:9) ~ "summer"
)) %>%
mutate
(day_hour =
case_when
(
between
(hour, 1, 12) ~ "morning",
between
(hour, 13, 24) ~ "not morning"
)) %>%
select
(arr_delay, dep_delay, season,
day_hour, origin, carrier)
And now we can run a randomization approach to
χ2-statistic
:
chisq_null_distn <- fli_small %>%
specify
(origin ~ season) %>% # alt: response = origin, explanatory = season
hypothesize
(null = "independence") %>%
generate
(reps = 1000, type = "permute") %>%
calculate
(stat = "Chisq")
chisq_null_distn %>%
visualize
(obs_stat = obs_chisq, direction = "greater")
Data cleansing is a topic very close to me. I’ve been working with my team at Iron-AI to create a tool for Python called Optimus. You can see more about it here:
But this tool I’m showing you is a very cool package with simple functions for data cleaning.
It has three main functions:
data.frame
column names;
table()
; and
Oh, and it’s a
tidyverse
-oriented package. Specifically, it works nicely with the
%>%
pipe and is optimized for cleaning data brought in with the
readr
and
readxl
packages.
install.packages("janitor")
I’m using the example from the repo, and the data
dirty_data.xlsx
.
library(pacman) # for loading packages
p_load(readxl, janitor, dplyr, here)
roster_raw <- read_excel(here("dirty_data.xlsx")) # available at http://github.com/sfirke/janitor
glimpse(roster_raw)
#> Observations: 13
#> Variables: 11
#> $ `First Name` <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", N...
#> $ `Last Name` <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lam...
#> $ `Employee Status` <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Tea...
#> $ Subject <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English",...
#> $ `Hire Date` <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 347...
#> $ `% Allocated` <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80
#> $ `Full time?` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", ...
#> $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
#> $ Certification <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science ...
#> $ Certification__1 <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", N...
#> $ Certification__2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
With this:
roster <- roster_raw %>%
clean_names() %>%
remove_empty(c("rows", "cols")) %>%
mutate(hire_date = excel_numeric_to_date(hire_date),
cert = coalesce(certification, certification_1)) %>% # from dplyr
select(-certification, -certification_1) # drop unwanted columns
With the
clean_names()
function, we’re telling R that we’re about to use janitor. Then we clean the empty rows and columns, and then using dplyr we change the format for the dates, create a new column with the information of
certification
and
certification_1
, and then drop them.
And with this piece of code…
roster %>% get_dupes(first_name, last_name)
we can find duplicated records that have the same name and last name.
The package also introduces the tabyl function that tabulates the data, like table but pipe-able, data.frame-based, and fully featured. For example:
roster %>%
tabyl(subject)
#> subject n percent valid_percent
#> Basketball 1 0.08333333 0.1
#> Chemistry 1 0.08333333 0.1
#> Dean 1 0.08333333 0.1
#> Drafting 1 0.08333333 0.1
#> English 2 0.16666667 0.2
#> Music 1 0.08333333 0.1
#> PE 1 0.08333333 0.1
#> Physics 1 0.08333333 0.1
#> Science 1 0.08333333 0.1
#> <NA> 2 0.16666667 NA
You can do a lot more things with the package, so visit their site and give them some love :)
This add-in allows you to interactively explore your data by visualizing it with the ggplot2 package. It allows you to draw bar graphs, curves, scatter plots, and histograms, and then export the graph or retrieve the code generating the graph.
Install from CRAN with :
# From CRAN
install.packages("esquisse")
Then launch the add-in via the RStudio menu. If you don’t have
data.frame
in your environment, datasets in
ggplot2
are used.
ggplot2
builder addin
Launch the add-in via the RStudio menu or with:
esquisse::esquisser()
The first step is to choose a
data.frame
:
Or you can use a dataset directly with:
esquisse::esquisser(data = iris)
After that, you can drag and drop variables to create a plot:
You can find information about the package and sub-menus in the original repo:
Exploratory Data Analysis (EDA) is an initial and important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process can be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.
The package can be installed directly from CRAN.
install.packages("DataExplorer")
With the package you can create reports, plots, and tables like this:
## Plot basic description for airquality data
plot_intro
(airquality)
## View missing value distribution for airquality data
plot_missing
(airquality)
## Left: frequency distribution of all discrete variables
plot_bar
(diamonds)
## Right: `price` distribution of all discrete variables
plot_bar
(diamonds, with = "price")
## View histogram of all continuous variables
plot_histogram
(diamonds)
You can find much more like this on the package’s official webpage:
And in this vignette:
Sparklyr will allow you to:
You can install the Sparklyr package from CRAN as follows:
install.packages("sparklyr")
You should also install a local version of Spark for development purposes:
library
(sparklyr)
spark_install(version = "2.3.1")
The first part of using Spark is always creating a context and connecting to a local or remote cluster.
Here we’ll connect to a local instance of Spark via the spark_connect function:
library(sparklyr)
sc <- spark_connect(master = "local")
We’ll start by copying some datasets from R into the Spark cluster (note that you may need to install the nycflights13 and Lahman packages in order to execute this code):
install.packages(c("nycflights13", "Lahman"))
library(dplyr)
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting")
src_tbls(sc)
## [1] "batting" "flights" "iris"
To start with, here’s a simple filtering example:
# filter by departure delay and print the first few records
flights_tbl %>% filter(dep_delay == 2)
## # Source: lazy query [?? x 19]
## # Database: spark_connection
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 542 540 2 923
## 3 2013 1 1 702 700 2 1058
## 4 2013 1 1 715 713 2 911
## 5 2013 1 1 752 750 2 1025
## 6 2013 1 1 917 915 2 1206
## 7 2013 1 1 932 930 2 1219
## 8 2013 1 1 1028 1026 2 1350
## 9 2013 1 1 1042 1040 2 1325
## 10 2013 1 1 1231 1229 2 1523
## # ... with more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Let’s plot the data on flight delays:
delay <- flights_tbl %>%
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
collect
# plot delays
library(ggplot2)
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area(max_size = 2)
## `geom_smooth()` using method = 'gam'
You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within Sparklyr. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.
Here’s an example where we use
ml_linear_regression
to fit a linear regression model. We’ll use the built-in
mtcars
dataset to see if we can predict a car’s fuel consumption (
mpg
) based on its weight (
wt
), and the number of cylinders the engine contains (
cyl
). We’ll assume in each case that the relationship between
mpg
and each of our features is linear.
# copy mtcars into spark
mtcars_tbl <- copy_to(sc, mtcars)
# transform our data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
filter(hp >= 100) %>%
mutate(cyl8 = cyl == 8) %>%
sdf_partition(training = 0.5, test = 0.5, seed = 1099)
# fit a linear model to the training dataset
fit <- partitions$training %>%
ml_linear_regression(response = "mpg", features = c("wt", "cyl"))
fit
## Call: ml_linear_regression.tbl_spark(., response = "mpg", features = c("wt", "cyl"))
##
## Formula: mpg ~ wt + cyl
##
## Coefficients:
## (Intercept) wt cyl
## 33.499452 -2.818463 -0.923187
For linear regression models produced by Spark, we can use
summary()
to learn a bit more about the quality of our fit and the statistical significance of each of our predictors.
summary(fit)
## Call: ml_linear_regression.tbl_spark(., response = "mpg", features = c("wt", "cyl"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.752 -1.134 -0.499 1.296 2.282
##
## Coefficients:
## (Intercept) wt cyl
## 33.499452 -2.818463 -0.923187
##
## R-Squared: 0.8274
## Root Mean Squared Error: 1.422
Spark machine learning supports a wide array of algorithms and feature transformations, and as illustrated above, it’s easy to chain these functions together with dplyr pipelines.
Check out more about machine learning with sparklyr here:
And more information in general about the package and examples here:
Nope, just kidding. But the name of the package is
drake
!
This is such an amazing package. I’ll create a separate post with more details about it, so wait for that!
Drake
is a package created as a general-purpose workflow manager for data-driven tasks. It rebuilds intermediate data objects when their dependencies change, and it skips work when the results are already up to date.
Also, not every run-through starts from scratch, and completed workflows have tangible evidence of reproducibility.
Reproducibility, good management, and tracking experiments are all necessary for easily testing others’ work and analysis. It’s a huge deal in Data Science, and you can read more about it here:
From Zach Scott :
And in an article by me :)
With
drake
, you can automatically
# Install the latest stable release from CRAN.
install.packages
("drake")
# Alternatively, install the development version from GitHub.
install.packages
("devtools")
library
(devtools)
install_github
("ropensci/drake")
There are some known errors when installing from CRAN. For more on these errors, visit:
I encountered a mistake, so I recommend that for now you install the package from GitHub.
Ok, so let’s reproduce a simple example with a twist:
I added a simple plot to see the linear model within
drake
’s main example. With this code, you’re creating a plan for executing your whole project.
First, we read the data. Then we prepare it for analysis, create a simple hist, calculate the correlation, fit the model, plot the linear model, and finally create a
rmarkdown
report.
The code I used for the final report is here:
If we change some of our functions or analysis, when we execute the plan,
drake
will know what has changed and will only run those changes. It creates a graph so you can see what’s happening:
In Rstudio, this graph is interactive, and you can save it to HTML for later analysis.
There are more awesome things that you can do with
drake
that I’ll show in a future post :)
Explaining machine learning models isn’t always easy. Yet it’s so important for a range of business applications. Luckily, there are some great libraries that help us with this task. For example:
(By the way, sometimes a simple visualization with
ggplot
can help you explain a model. For more on this check the awesome article below by
Matthew Mayo
)
In many applications, we need to know, understand, or prove how input variables are used in the model, and how they impact final model predictions.
DALEX
is a set of tools that helps explain how complex models are working.
To install from CRAN, just run:
install.packages("DALEX")
They have amazing documentation on how to use DALEX with different ML packages:
Great cheat sheets:
Here’s an interactive notebook where you can learn more about the package:
And finally, some book-style documentation on
DALEX
, machine learning, and explainability:
Check it out in the original repository: